70

Algorithms for Binary Neural Networks

is the reconstruction error that is assumed to obey a Gaussian prior with zero mean and

variance ν. Under the most probable y (corresponding to y = 0 and x = w1 ˆx, i.e., the

minimum reconstruction error), we maximize p(x|y) to optimize x for quantization (e.g.,

1-bit CNNs) as:

max p(x|y),

(3.98)

which can be solved based on Bayesian learning that uses Bayes’ theorem to determine the

conditional probability of a hypothesis given limited observations. We note that the calcu-

lation of BNNs is still based on optimizing x, as shown in Fig. 3.19, where the binarization

is performed based on the sign function. Equation 3.98 is complicated and difficult to solve

due to the unknown w1 as shown in Eq. 3.97. From a Bayesian learning perspective, we

resolve this problem via Maximum A posteriori (MAP):

max p(x|y) = max p(y|x)p(x)

= min ||ˆxwx||2

2 2ν log



p(x)



,

(3.99)

where

p(y|x)exp(1

2ν ||y||2

2)exp(1

2ν ||ˆxwx||2

2).

(3.100)

In Eq. 3.100, we assume that all components of the quantization error y are i.i.d., thus

resulting in a simplified form. As shown in Fig. 3.19, for 1-bit CNNs, x is usually quantized

to two numbers with the same absolute value. We neglect the overlap between the two

numbers, and thus p(x) is modeled as a Gaussian mixture with two modes:

p(x)= 1

2(2π)N

2 det(Ψ)1

2



exp



(xμ)T Ψ1(xμ)

2



+ exp



(x + μ)T Ψ1(x + μ)

2



1

2(2π)N

2 det(Ψ)1

2



exp



(x+μ+)TΨ1

+ (x+μ+)

2



+ exp



(x+ μ)T Ψ1

(x+ μ)

2



,

(3.101)

where x is divided into x+ and xaccording to the signs of the elements in x, and N is

the dimension of x. Accordingly, Eq. 3.99 can be rewritten as:

min||ˆxwx||2

2 + ν(x+ μ+)T Ψ1

+ (x+ μ+)

+ ν(x+ μ)T Ψ1

(x+ μ) + ν log



det(Ψ)



,

(3.102)

where μand μ+ are solved independently. det(Ψ) is accordingly set to be the determinant

of the matrix Ψor Ψ+. We call Eq. 3.102 the Bayesian kernel loss.

Bayesian feature loss: We also design a Bayesian feature loss to alleviate the disturbance

caused by the extreme quantization process in 1-bit CNNs. Considering the intraclass com-

pactness, the features fm of the m-th class supposedly follow a Gaussian distribution with

the mean cm as revealed in the center loss [245]. Similarly to the Bayesian kernel loss, we

define ym

f = fmcm and ym

f ∼N(0, σm), and we have:

min||fmcm||2

2+

Nf



n=1



σ2

m,n(fm,ncm,n)2+log(σ2

m,n)



,

(3.103)

which is called the Bayesian feature loss. In Eq. 3.103, σm,n, fm,n, and cm,n are the n-th

elements of σm, fm, and cm, respectively. We take the latent distributions of kernel weights

and features into consideration in the same framework and introduce Bayesian losses to

improve the capacity of 1-bit CNNs.